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Abstract 

Recent studies in the field of human vision science suggest that the human responses to the stimuli on a visual display 
are non-deterministic. People may attend to different locations on the same visual input at the same time. Based on this 
knowledge, we propose a new stochastic model of visual attention by introducing a dynamic Bayesian network to predict 
the likelihood of where humans typically focus on a video scene. The proposed model Is composed of a dynamic Bayesian 
network with 4 layers. Our model provides a framework that simulates and combines the visual saliency response and 
the cognitive state of a person to estimate the most probable attended regions. Sample-based inference with Markov 
chain IVIonte-Carlo based particle filter and stream processing with muitl-core processors enable us to estimate human 
visual attention In near real time. Experimental results have demonstrated that our model performs significantly better In 
predicting human visual attention compared to the previous deterministic models. 
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Human visual attention, saliency, dynamic Bayesian network, state space model, hidden IVlarkov model, Markov chain 
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1 Introduction 

Developing a sophisticated object detection and recognition algorithms has been a long distance challenge 
in computer and robot vision researches. Such algorithms are required in most applications of compu- 
tational vision, including robotics 111, medical imaging [21, intelligent cars ||3l, surveillance fH, image 
segmentation ||5l, ||6| and content-based image retrieval [7]. One of the major challenges in designing 
generic object detection and recognition systems is to construct methods that are fast and capable of 
operating on standard computer platforms without any prior knowledge. To that end, pre-selection 
mechanism would be essential to enable subsequent processing to focus only on relevant data. One 
promising approach to achieve this mechanism is visual attention: it selects regions in a visual scene that 
are most likely to contain objects of interest. The field of visual attention is currently the focus of much 
research for both biological and artificial systems. 

Attention is generally controlled by one or a combination of the two mechanisms: 1) a top-down 
control that voluntarily chooses the focus of attention in a cognitive and task-dependent manner, and 2) 
a bottom-up control that reflexively directs the visual focus based on the observed saliency attributes. 
The first biologically-plausible model for explaining the human attention system was proposed by Koch 
and Ullman fSl, which follows the latter approach. The basic concept underlying this model is the feature 
integration theory developed by Treisman and Gelade Q which has been one of the most influential 
theories of human visual attention. According to the feature integration theory, in a first step to visual 
processing, several primary visual features are processed and represented with separate feature maps 
that are later integrated in a saliency map that can be accessed in order to direct attention to the most 
conspicuous areas. In an example shown in Fig. [l| a red car placed on the right in the frame should 
be attentive, and therefore people directs one's attention to this area. The Koch-UlLman model has been 
attracting attention of many researchers, especially after the development of an implementation model 
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by Itti, Koch and Niebur |10|. Later, so many attempts have been made to improve the Koch-Ullman 
model ini, Ca, 1131, im, 1151 and to extend it to video signals lEl, CH, HZl, ITSl . 

Although the feature integration theory well explains the early human visual system, a part of the 
theory includes one crucial problem, namely, people may attend to different locations on the same visual 
input at the same time. The example shown in Fig. [T] exactly indicates the phenomena: people may 
pay attention to a blue traffic sign at the center, a white line at the bottom left or others. Previously, 
this inconsistent visual attention has been considered to be caused by object-based attention, rather than 
location-based attention ETl , which implies that inconsistent visual attention are heavily controlled by 
higher order processes such as top-down intention, knowledge and preferences. Another typical example 
can be seen in Fig.|2] Let us consider a search task with a single 45° target among a lot of distractors. We 
can easily understand that the left case is easy and the right case is difficult to find the target. However, 
based on the feature integration theory, we can immediately identify the target for both easy and hard 
searches since we always select the location where the response of the detector tuned to the target visual 
property is greater than at any other locations. 

On the other hand, another theory to understanding visual search and attention has been of interest, 
called the signal detection theory (HI, EOl . According to this theory, the elements in a visual display are 
internally represented as independent random variables. Again let us consider the search task shown in 
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Fig. |2] The response of a detector tuned to the target orientation is represented as a Gaussian density. The 
response of the same detector to the distractor is also a Gaussian density with lower mean value. For a 
45° target and vertical distractors, these densities barely overlap, which implies that we can immediately 
detect the target. On the other hand, in the case of hard search, the target density is identical to the easy 
search case, but the distractor density is shifted rightward, so that the two densities corresponding to the 
target and distractor overlap. This implies that the probability we focus on the distractors becomes high 
and therefore it takes much time to detect the target. 

With the paradigm of the signal detection theory, we proposes a new stochastic model of visual 
attention. With this model, we can automatically predict the likelihood of where humans t5rpicaUy focus 
on a visual input. The proposed model is composed of a dynamic Bayesian network with four layers: 
(1) a saliency map that shows the average saliency response at each position of a video frame, (2) a 
stochastic saliency map that converts the saliency map into a natural human response through a Gaussian 
state space model based on the finding of the signal detection theory, 3) an eye movement pattern that 
controls the degree of "overt shifts of attentions" (shifts with saccadic eye movements) through a hidden 
Markov model (HMM), and 4) an eye focusing density map that predicts positions that people probably 
pay attention to based on the stochastic saliency map and eye movement patterns. When describing the 
Bayesian network of visual attention, the principle of the signal detection theory is introduced, namely, 
the position where values of the stochastic saliency map takes the maximum is the eye focusing positions. 
The proposed model also provides a framework simulating top-down cognitive states of a person at the 
layer of eye movement patterns. The introduction of eye movement patterns as hidden states of HMMs 
enables us to describe the mechanism of eye focusing and eye movement naturally. 

The paper is organized as follows: Section |2] discusses several related researches that focuses on 
modeling of human visual attention by using probabilistic techniques or concepts. Section |3] describes 
the proposed stochastic model of visual attention. Section |4] presents the methods for finding maximum 
likelihood (ML) estimates of the model parameters based on the Expectation-Maximization (EM) frame- 
work. Section |5] discusses some evaluation results. Finally Section |6] summarizes the report and discusses 
future work. 

2 Related work 

Several previous researches focused on modeling of human visual attention by using some kind of 
probabilistic techniques or concepts. Itti and Baldi [16] investigated a Bayesian approach to detecting 
surprising events in video signals. Their approach models a surprise by Kullback-Leibler divergence 
between the prior and posterior distributions of fundamental features. Avraham and Lindenbaum Il22ll 
utilized a graphical model approximation to extend their static saliency model based on self similarities. 
Boccignone lH introduced a nonparametric Bayesian framework to achieve object-based visual attention. 



April 2, 2010 



DRAFT 



IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XXX, NO. XXX, XXXXX 2010 



5 



i 








Sit 




at- 








Intention 

Eye movement 
patterns 

Action 



Eye-focusing 
density maps ^ 



Response 

Stocliastic 
salienq^ maps 

Stimuius 

(detenninistic) 
saliency maps 

input 

Input video 



Fig. 3. Graphical representation of tine proposed stoclnastic model of human visual attention, where arrows 
indicate stochastic dependencies. 



Gao, Mahadevan and Vasconcelos IITSl , Il23l developed a decision-theoretic approach attention model for 
object detection. 

The main contribution of our stochastic model against the above previous researches is the introduction 
of a unified stochastic model that integrates "covert shifts of attention" (shifts of attentions without 
saccadic eye movements) driven by bottom-up saliency with "overt shifts of attention" (shifts of attention 
with saccadic eye movements) driven by eye movement patterns by using a dynamic Bayesian network. 
Our proposed model also provides a framework that simulates and combines the bottom-up visual 
saliency response and the top-down cognitive state of a person to estimate probable attended regions, if 
eye movement patterns can deal with more sophisticated top-down information. How to integrate such 
kinds of top-down information is one of the most important future researches. 

3 Stochastic visual attention model 

3.1 Overview 

Figs.|3]and|4]illustrates the graphical representation of the proposed visual attention model. The proposed 
model consists of four layers: (deterministic) saliency maps, stochastic saliency maps, eye focusing posi- 
tions and eye movement patterns. Before describing the model of the proposed visual attention model, 
let us introduce several notations and definitions. 
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Fig. 4. Procedure of the proposed model 



I ~ i{l : T) ~ {i{t)}J^i denotes an input video, where i{t) is the t-th frame of the video / and T is the 
duration (i.e. the total number of frames) of the video /. The symbol / also denotes a set of coordinates 
in the frame. For example, a position y in a frame is represented as y £ /. 

S = S{1 : T) = {S{t)}J^^ denotes a saliency video which comprises a sequence of saliency maps S{t) 
obtained from the input video /. Each saliency map is denoted as S{t) — {s(t, where s{t,y) is 

called saliency which is the pixel value at the position y E I. Each saliency represents the strength of 
visual stimulus on the corresponding position of a frame with the real value between and 1. 

S — 5(1 : T) = {S'(i)}^i denotes a stochastic saliency video which comprises a sequence of stochastic 
saliency maps S{t) obtained from the input video /. Each stochastic saliency map is denoted as S{t) = 
{s{t,y)}y^j, where s{t,y) is called stochastic saliency which is the pixel value at the position ye/. 
Each stochastic saliency corresponds to saliency response perceived through a certain kind of random 
processes. 

U = u{l : T) = {u{t)}f^i denotes a sequence of eye movement patterns each of which represents a 
pattern of eye movements, A previous research |24J implies that there are two typical patterns |^ of eye 
movements when one is simply watching a video: 1) Passive state u{t) = in which one tends to stay 

1. Peters and Itti 1241 prepared the other pattern, interactive state, which can be seen when playing video games, driving a car 
or browsing webs. We will omit the interactive state since our setting in this paper does not include any interactions 
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around one particular position to continuously capture important visual information, and 2) active state 
u{t) = 1 in which one actively moves around and searches various visual information on the scene. Eye 
movement patterns may reflect purposes or intentions of human eye movements. 

X = X{1 : T) = {x{t)}J^T^ denotes a sequence of eye focusing positions. The proposed model estimates 
the eye focusing position by integrating the bottom-up information (stochastic saliency maps) and the 
top-down information (eye movement patterns). A map that represents a density of eye focusing positions 
is called an eye focusing density map. 

Only the saliency maps are observed, and therefore eye focusing positions should be estimated imder 
the situation where other layers (stochastic saliency maps and eye movement patterns) are hidden. 

In what follows, we denote a probability density function (PDF) of an x as p{x), a conditional PDF of 
an X given y as p{x\y), and a PDF of x with a parameter 6 as p{x; 6). 

The rest of this section describes the detail of the proposed stochastic model and the method for 
estimating eye focusing positions only from input videos. 

3.2 Saliency maps 

We used Itti-Koch saliency model IITOll shown in Fig. |5] to extract (deterministic) saliency maps. Our 
implementation includes twelve feature channels sensitive to color contrast (red /green and blue/yellow), 
temporal luminance flicker, luminance contrast, four orientations (0°, 45°, 90° and 135°), and two oriented 
motion energies (horizontal and vertical). These features detect spatial outliers in image space using a 
center-surround architecture. Center and surround scales are obtained from dyadic pyramids with 9 
scales, from scale (the original image) to scale 8 (the image reduced by a factor of 2^ = 256 in both the 
horizontal and vertical dimensions). Six center-surround difference maps are then computed as point- 
wise differences across pyramid scales, for combinations of three center scales (c = {2,3,4}) and two 
center-surround scale differences (s = {3,4}). Each feature map is additionally endowed with internal 
dynamics that provide a strong spatial within-feature and within-scale competition for activity, followed 
by within-feature, across-scale competition. In this way, initially noisy feature maps can be reduced to 
sparse representations of only outlier locations which stand out from their surroundings. All feature maps 
finally contribute to a unique saliency map representing the conspicuity of each location in the visual 
field. The saliency map is adjusted with a centrally-weighted 'retinal' filter, putting a higher emphasizes 
on the saliency values around the center of the video. 

3.3 Stochastic saliency maps 

When estimating a stochastic saliency map S{t) — {s{t, y)}y^i, we introduce a pixel-wise state space model 
characterized by the following two relationships: 

p{s{t,y)\s{t,y)) = Afis{t,y),asi), 
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Fig. 5. Saliency map extraction using Itti et al.'s model 



p{s{t,y)\s{t^l,y)) = JV{s{t-l,y),as2), 

where Af{s, a) is the Gaussian PDF with mean s and variance cr^. The first equation in the above model 
implies that a saliency map is observed through a Gaussian random process, and the second equation 
exploits the temporal characteristics of the human visual system. For brevity, only in this section we will 
omit the position y where explicit expression is unnecessary, e.g. s{t) instead of s{t, y). 

We employ a Kalman filter to recursively compute the stochastic saliency map. Assume that the density 
at each position on the stochastic saliency map s(t — 1) at time < — 1 given saliency maps s(l : i — 1) up 
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to time t — 1 is given as the following Gaussian PDF: 

pis{t - l)\s{l : t - I)) 

= U{s{t-l\t-l),as{t-l\t-l)). 

where the position y is omitted for simplicity. Then, the density p{s{t)\s{l : t)) of the stochastic saliency 
map at time t is updated by the following recurrence relations with the saliency maps s{l : t) up to time 

t 



[Estimation step] 



where 



[Update step] 



where 



p{s{t)\s{l : t ~ I j) = Mmt\t-l),a,{t\t-l)), 

s{t\t-l) = s{t-l\t-l), 
aUt\t-l) = a22 + a2(t-l|i-l). 

p{sit)\s{l:t)) = Misit\t),asm), 

m) = (1) 

-s{t\t - 1) + 2 , _^^^|^ TT 



Remark 1: The above model implies that model parameters (ctsi, <Js2) of every Gaussian random variable 
is independent from the frame index t and the position y. We can easily extend the model to consider 
adaptive model parameters depending on the frame index and the position. In this case, model parameters 
can be updated via on-line learning with adaptive Kalman filters (e.g. Il25l , Il26l .) (Remark^ends.) 



3.4 Estimating eye motions 

3.4. 1 Overview 

By incorporating the stochastic saliency map S{t) = {s{t,y)}yf=i and the eye movement pattern u{t), we 
introduce the following transition PDF to estimate the eye focusing position x{t) such that 

p{x{t),u{t)\p{S{t)),x{t^l),u{t-l)) 
cx p{x{t)\p{S{t))) 

■p{u{t)\u{t - I)) ■ p{x{t)\x{t - (3) 
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where the PDF of the stochastic saliency map at time t is represented as p(S(t)) for simplicity, namely 

p{s{t,y)) - p{s{t,y)\s[l:t,y)) Vy € I. 

The stochastic saliency map S{t) controls "covert shifts of attention" through the PDF p{x{t)\p{S{t))) 
1^ On the other hand, the eye movement pattern u{t) controls the degree of "overt shifts of attention". 
In what follows, we call a pair z{t) ~ {x{t),u{t)), consisting of an eye focusing position and an eye 
movement pattern, as the eye focusing state z{t) for brevity. The following PDF of eye focusing positions 
x{t) given a PDF p{S{l : t)) of stochastic saliency maps up to time t characterizes an eye focusing density 
map at time t: 

p{x{t)\p{S{l:t)))^ J2 p{z(tMS{l:t))), (4) 

Ji(t)=0,l 



pizitMSil : t))) = / p{z{t - iMSil : t - 1))) 

Jz(t-l) 

■p{zitMS{t}),z{t^l))dz{t-l). (5) 

Since the formula for computing Eq. Q cannot be derived, we introduce a technique inspired by a 
particle filter with Markov chain Monte-Carlo (MCMC) sampling instead. The PDF of eye focusing states 
shown in Eq. (jsjl can be approximated by samples of eye focusing states {zn{t)}n^i and the associated 
weights {wn{t)}^^i as 

N 

p{z{tMS{l:t))) « ^ii;„(i).J(z(t),z„(i)), (6) 

n=l 

where N is the number of samples, and S{-,-) represents Kronecker delta. 

Fig. |6] shows the procedure for estimating eye focusing density maps, which can be separated into 
three steps: 1) generating samples from the PDFs p{u{t)\u{t — 1)) and p{x{t)\x{t — l),u{t)) derived from 
an eye movement pattern, 2) weighting samples with the PDF p{x{t)\p{S{t))) derived from a stochastic 
saliency map, and 3) re-sampling if necessary. We now describe each step in detail. 

3.4.2 Propagation witli eye movement patterns 

The second and third terms of Equation (jSj suggests that the current eye focusing position depends on 
the previous eye focusing position, and the degree of eye movements is driven by one's eye movement 
pattern u{t). 

2. The notation p{x{t)\p{S{t))) seems to be unusual, however, the PDF of eye focusing positions x{t) estimated from the stochastic 
saliency map S{t) can be determined by the PDF of the stochastic saliency map, not the stochastic saliency map itself, as shown 



in Section 3.4.3 
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Fig. 6. Strategy for calculating eye focusing density maps 



The second term p{u{t)\u{t — 1)) of Equation ||3} is characterized by the the transitional probability 
<& = {4'(i.j)} of eye movement patterns defined by a 2 x 2 matrix given in advance. 

p{u{t)\u{t ~ 1)) = (7) 

The third term p{x{t)\x{t — l),u{t)) of Equation ||3j represents the transition PDF of eye focusing 
positions governed by the eye movement patterns at the current time (See Figure [7|, defined as 

p{x{t)\x{t- l),u{t)) = /:(a;(t- l),7^_„(t),cr^_„(t)), (8) 

where 'jxi and axi {i = 1,2) are model parameters that represents the average and standard deviation 
of distances of eye movements, and C{x, 7, a) is a shifted 2D Gaussian PDF with mean x, indent 7 and 
variance such that 

i\\x-x\\-^)^' 



C{x,-/,C7) OC exp|- 

Samples {zn{t)}n=i of eye focusing states are generated with a technique of MCMC sampling. Suppose 
that samples {zn{t—l)}^^i of eye focusing states at time t — 1 have already been obtained. Then, samples 
{zn{t)}'i^^-^ at time t are drawn by using the second and third terms of Equation ijsj with the Metropolis 
algorithm [27 1 such as 

Zn{t) = {Xn{t),Un{t)}, 
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Fig. 7. Transition PDF of eye focusing positions governed by the eye movement patterns 



Un{t) ^ p{u{t)\Un{t - I)) , 

Xn{t) ^ p{x{t)\Xn{t - l),Un{t)), 

where x ^ p{x) indicates that a sample x is drawn from a PDF p{x). This top-down part corresponds to 
the propagation step of a particle filter. 

3.4.3 Updating with stocliastic saliency maps 

As the second step, sample weights {'Wn{t)}^^i are updated based on the first term p{x{t)\p{S{t))) of 
Equation (|3}. Formally, the weight Wn{t) of the n-th sample Zn{t) at time t can be calculated as 

Wn{t) CX Wn{t-l)-p{x{t)^Xn{t)\p{S{t))). 

As shown in Equation iji}, samples {zn{t)}n=i of eye focusing states and the associated weights {wn{t)}n=i 
comprise an eye focusing density map at time t. This step corresponds to the update step of a particle 
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filter. 

The first term of Equation ||3| represents the fact that the eye focusing position is selected based on 
the signal detection theory, where the position at which the stochastic saliency takes the maximum is 
determined as the eye focusing position. In other words, this term computes the probability at each 
position that the stochastic saliency takes the maximum, which can be calculated as 

p{x{t)\p{sm 

/oo 
p{s{t,x{t))^s) l[ P{s{t,x) < s)ds, (9) 

where P{s{t,y) < s) is the cumulative density function (CDF) that corresponds to the PDF p{s{t,y)) 
of the stochastic saliency s{t,y). The first part of Equation ^ stands for the probability such that a 
stochastic saliency value at position x{t) equals s, and the second part represents the probability such 
that stochastic saliency values at any other positions are smaller than s. 

Direct computation of Equation ||9} is intractable. Instead, we introduce an alternative expression of 
Equation ||9} that is applicable to stream processing with multi-core processors. 

p{x{tMsm 



The latter part of Eq. 1 10 1 does not depend on the position x{t), which implies that it can be calculated in 
advance for every s. This calculation can be executed in ^^(log |/|) time through a tree-based multiplication 
and parallelization at each level (cf. Fig. |8}. Also, the former part of Eq. | [T0| can be calculated indepen- 
dently for each position x{t). Therefore, once the calculation of the latter part has finished, Eq. | [To| can 
be calculated in ©(log \ S\) time with a combination of tree-based addition and pixel-wise parallelization, 
where \S\ stands for the resolution of the integral in Eq. ([T0|. 



3.4.4 Re-sampling 

Finally, re-sampling is performed to eliminate samples with low importance weights and multiply samples 
with high importance weights. This step enables us to avoid "degeneracy" problem, namely, to avoid the 
situation where all but one of the importance weights are close to zero. Although the effective number 
of samples ||28] is frequently used as a criterion for re-sampling, we execute re-sampling at regular time 
intervals. 

Remark 2: We have to note that the whole procedure which includes the propagation, updating and 
re-sampling steps for estimating eye focusing density maps is equivalent to a particle filter with MCMC 
sampling since the the PDFs used in the propagation and update steps are mutually independent with 
each other. (Remark^ ends.) 
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4 Model parameter estimation 

This section focuses on the problem of estimating maximum likelihood (ML) model parameters. Fig. |9] 
shows the block diagram of our model parameter estimation. We can automatically estimate almost all 
the model parameters in advance by utilizing saliency maps calculated from the input video and eye 
focusing positions obtained by some eye tracking devices as observations. Simultaneous estimation of 
all ML parameters can be optimal but impractical due to the substantial calculation cost. Therefore, we 
separate our parameter estimation into two independent stages. The first stage derives parameters for 
computing stochastic saliency maps, and the second stage for estimating eye focusing points. 
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Fig. 9. Block diagram of our model parameter estimation 



4.1 Parameters for stochastic saliency maps 

The first stage derives parameters for computing stochastic saliency maps. Here, we introduce the EM 
algorithm. In this case, the observations are the saliency maps S = S{1 : T) and the hidden variables 
are the stochastic saliency maps S — 5(1 : T). Remember that T is the duration of the video. The EM 
algorithm for estimating ds — (crsi,crs2) is as follows: 

{k + l)-th E step 

The E step updates the PDF p{S\S;9s,k) of the stochastic saliency maps S given the saliency maps 
S with the previously estimated parameter 6*^ ^ = {crsi,(Ts2) by using Kalman smoother. In detail, the 
objective is to recursively compute the mean 's{t\T) and standard deviation <Js{t\T) of the stochastic 
saliency s{t) at time t = 1,2, • • • ,T, where all the saliency maps S are used as observations. Note that 
the position y is omitted for simplicity. 

Suppose that the PDF of the stochastic saliency at time i + 1 is given by the following Gaussian PDF: 

p{s{t+l)\S-e,,u) = N{Mt + l\T),a,^u[t+l\T)). 
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Then, the PDF of the stochastic saliency at time t is obtained by the following recurrence relation: 

p{s{t)\S;9,^k) - M{sk{t\T),as,k{t\T)), 

where 



V s2,k ) 

2 <^i2.k<km 



and Sk{t\t) and erf can be obtained by Eqs. with the parameter O^^k- 

{k + l)-th M step 

The M step updates the parameter 0s to maximize the expected log likelihood of the PDF p{S, S;9s). 
We can derive a new parameter 6*^ ^+1 from the result of the E step by taking the derivatives of the log 
likelihood in terms of Og and setting to 0. 

T 

T 



t=i 

L_j2[iskm-Mt-mr 



's2,k+l — rp 



t=2 



+a,%(t - 1|T) + ''f'^ ^f-j' % IMT) 



s2,k 

4.2 Parameters for eye focusing positions 

The second stage derives parameters 6^; = (7^0/ o-xo, lx\, '^x\, for computing eye focusing positions. The 
observations are the sequence of eye focusing positions X. = x{\ -.T) obtained from some eye tracking 
devices, and the hidden states are the eye movement patterns [/ = u(l : T). In this section, we introduce 
an alternative notation of eye movement patterns as u(t) — {u{t)o,u{t)i)'^ , which is a 2-dimensional 
binary vector such that (1,0)^ denotes the passive state, and (0, 1)^ represents the active state. 

We take a Viterbi learning approach for its quick convergence. It recursively updates the eye movement 
patterns U = u{l : T) and the ML parameter set Ox to maximize the posterior p{U\X; O^). 

Initializing eye movement patterns 

We have to start with determining an initial sequence Uq = uo{1 : T) of eye movement patterns. We 
introduce the following decision rule: 

/ if \\x{i)-x{t-l)\\<^x 
I 1 if \\x{t) - x{t - 1)11 > Kx 

where is a given threshold. 
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The {k + l)-th step for updating hidden variables 

This step updates the sequence U of eye movement patterns to maximize the posterior density p{U\X; O^^k) 
given the parameter set O^^k obtained in the previous step. 

C/fc+i ^ argniaxp(C/|X;6'^,fe). 

Viterbi algorithm (e.g. ||29l, ||30]) can derive the ML sequence Uk+i of eye movement patterns. 

The {k + l)-th step for updating the parameter set 

This step updates the parameter set 6^ to maximize the posterior density p{Uk+i\X;6x). 

Ox.k+i = a.Tgmaxp{Uk+i\X;6x). 

Taking the derivative of the log likelihood in terms of Ox^ we obtain 

'~fxi,k-\-l 



Y.'t=2 Uk+l{t)i 



2 

"xi,k+l 



Et=2(l!^(0 - ^(^ - 1)11 - lxi,k)'^Uk+l{t)i 



EL2 "fe+l(*- l)i 

5 Evaluation 

5.1 Evaluation conditions 

For the accuracy evaluation, we used CRCNS eye-1 database created by University of South California. 
This database includes 100 video clips (MPEG-1, 640 x 480 pixels, 30fps) and eye traces when showing 
these video clips to 8 human subjects (4-6 available eye traces for each video clip, 240fps). Other details 
for the database can be found in https://crcns.0rg/files/data/eye-l/crcns-eyel-summar1/.pdf \ In this evaluation. 



we used 50 video clips (about 25 minutes in total) called "original experiment" and associated eye traces. 

Model parameters were derived in advance with the learning algorithm presented in Section |4] In this 
time, we used 5-fold cross validation so that 40 video clips and associated eye traces were used as the 
training data for evaluating the remaining data (10 videos and associated eye traces). 

All the algorithms were implemented with a standard C++ platform and NVIDIA CUDA, and the 
evaluation were carried out on a standard PC with a graphics processor unit (GPU). The detailed 
information for the platform used in this evaluation is listed in Table [ij 
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TABLE 1 
Platform used in the evaluation 



OS 


Windows Vista Ultimate 


Development 
platform 


Microsoft Visual C++ 2008 
OpenCV l.lpre & NVIDIA CUDA 2.2 


Optimization 


disabled 


CPU 


Intel Core2 Quad Q6600 (2.40GHz) 


RAM 


4.0GB 


GPU 


NVIDIA GeForce GTX275 x 2 SLI 



5.2 Evaluation metric 

As a metric to quantify how well a model predicts the actual human eye focusing positions, we used the 
normalized scan-path saliency (NSS) used in the previous work f2S]. Let Rj{t) be a set of all pixels in a 
circular region centered on the eye focusing position of test subject j with a radius of 30 pixels. Then, 
the NSS value at time t is defined as 

1 ^= 1 r 1 

NSS{t) = — y , , <^ max p{x{t)) - p{x) } , 

where is the total number of subjects, v(x) and aipix)) are the mean and the variance of the pixel 
values of the model's output, respectively. NSS{t) = 1 indicates that the subjects' eye positions fall in 
a region whose predicted density is one standard deviation above average. Meanwhile, NSS{t) < 
indicates that the model performs no better than picking a random position on the map. 



5.3 Results 

We compared our proposed method with 3 existing computational models: 1) a simple control measuring 
local pixel variance (denoted "variance") [16^], 2) a saliency map (denoted "CIOFM") UTOl , and 3) Bayesian 
surprise (denoted "surprise") II61I . All the outputs emitted from the above existing models are included 
in CRCNS eye-1 database, and therefore we directly utilized them for the evaluation. 

Fig. [10] shows the model accuracy measured by the average NSS score with standard errors for all the 



video clips, and Fig. 11 details the average NSS score for each video clip. The order of video clips is 



sorted beforehand to keep the visibility. The result shown in Fig. 10 indicates that the our new method 
achieved significantly better scores than all 3 existing methods, which implies that our proposed method 
can estimate human visual attention with high accuracy. Also, the result shown in Fig. [Tl] indicates that 
our proposed method marked almost the same as or much better than all the existing methods for most 
of the video clips. 



Fig. 12 shows snapshots of outputs from Itti model (the second and fifth rows) and our proposed 
method (the third and sixth rows). It illustrates that outputs from Itti model included several large 
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Fig. 10. Average NSS score for all the video clips 



salient regions. On the other hand, outputs from our proposed method included only a few small eye 
focusing areas. This implies that our new method picked up probable eye focusing areas accurately. 



Fig. 13 shows the total execution time of 1) calculating Itti's saliency map without CUDA, 2) the 
proposed method without CUDA, and 3) the proposed method with CUDA. The result indicates that 
the proposed method has achieved near real-time estimation (40-50 msec/ frame), and almost the same 
processing time as the one for Itti's model. 



6 Conclusion 

We have presented the first stochastic model of human visual attention based on a d3mamic Bayesian 
framework. Unlike many existing methods, we predict the likelihood of human-attended regions on a 
video based on two criteria: 1) The probability of having the maximum saliency response at a given region 
evaluated based on the signal detection theory, and 2) the probability of matching the eye movement 
projection based on the predicted state. Experiments have revealed that our model offers a better eye- 
gazing prediction against previous deterministic models. To enhance our current model, future work may 
include determination of initial parameters close to the global optima when estimating model parameters, 
ixnified approach to estimate all the model parameters, a better density model of eye movements, a 
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The Proposed method 
achieved approx. the 
same score as others. 



The Proposed method 
achieved much better 
scores than others. 



-^variance 
-■-CIOFM 

-•-surprise 

A proposed model 




"i^^^a i "-^ a " aa^-s a^a" " aasa 



Fig. 1 1 . Average NSS score for each video clip, where the vertical axis uses a log scale and video clips on 
the horizontal axis are sorted in the ascending order of NSS scores. 



better integration of the bottom-up and the top-down information, a better saliency model for extracting 
(deterministic) saliency maps, and integration of the proposed method into some applications such as 
driving assistance, active vision and video retrieval. 
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Fig. 12. Snapshots of model outputs 
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